cdcevent: only fetch relevant columns in the rowfetcher#89329
cdcevent: only fetch relevant columns in the rowfetcher#89329craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
dd08b3d to
4f77618
Compare
|
Starting with this small PR to work towards #80976. Next steps (https://github.com/jayshrivastava/cockroach/commits/rowfetcher-2):
|
4f77618 to
fb45570
Compare
HonoreDB
left a comment
There was a problem hiding this comment.
Looks great! A couple of minor line comments. Also, please add a few tests in changefeed_test.go with tables with multiple column families and undergoing schema changes in ways that could change column ordering.
Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/cdcevent/event.go line 284 at r1 (raw file):
addColumn(col, last) sd.valueCols = append(sd.valueCols, last) last++
Maybe instead of giving these real-looking ordinals, they should have an ordinal of 2^32-1 which you can check for explicitly instead of looking for out-of-bounds indices.
pkg/ccl/changefeedccl/cdcevent/event.go line 663 at r1 (raw file):
columns := tableDesc.GetPrimaryIndex().CollectKeyColumnIDs() for _, colID := range familyDesc.ColumnIDs { col, err := tableDesc.FindColumnWithID(colID)
Nit: FindColumnWithID is linear in the number of columns, so this method is O(n^2). Probably better to build another column set from tableDesc.PublicColumnIDs, intersect it with familyDesc.ColumnIDs, and then union the result with the key column ids.
fb45570 to
9d4d5c5
Compare
Previously, a rowfetcher would be configured to fetch all public columns in a table, including virtual columns. Then, if we only wanted to read columns belonging to a particular family, we would index into this row. This change updates row fetchers to only fetch relevant data, including only the primary key columns and columns in a specified family. This change also updates the row fetcher such that it does not fetch virtual columns. Instead, they are added on by the event descriptor. If the descriptor specifies a column outside of the physical row, the row iterator will fill in a null value upon iteration, representing a virtual column. Release note (enterprise change): When a changefeed is run with the option virtual_columns = "null", the virtual column will be ordered last in each row.
9d4d5c5 to
8029ba7
Compare
There was a problem hiding this comment.
👍 edit: sorry I accidentally messed up my code quotes in reviewable.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB)
pkg/ccl/changefeedccl/cdcevent/event.go line 663 at r1 (raw file):
Previously, HonoreDB (Aaron Zinger) wrote…
Nit:
FindColumnWithIDis linear in the number of columns, so this method is O(n^2). Probably better to build another column set from tableDesc.PublicColumnIDs, intersect it with familyDesc.ColumnIDs, and then union the result with the key column ids.
I considered doing these set operations, but I realized it sometimes returns the wrong result.
Firstly, during a schema change, it's possible for familyDesc to have a column which is not public. Since the familyDesc does not have information about public columns, you have to check the tableDesc.
Secondly, when you change the type of a column (ie. change its position), calling tableDesc.PublicColumnIDs still keeps the column in its original position (although with a different ID, sorting by PGAttributeNum internally). If we use tableColSet.Ordered, it will sort by columnID.
To preserve correctness during schema changes and also the behavior of what happens in master (which uses tableDesc.PublicColumnIDs), I think what I have now is the best way to go forward.
pkg/ccl/changefeedccl/cdcevent/event.go line 133 at r2 (raw file):
// forEachColumn is a helper which invokes fn for reach column in the ordColumn list. func (r Row) forEachDatum(fn DatumFn, colIndexes []int) error {
forEachDatum and NewEventDescriptor now use virtualColOrd to put virtual cols in the correct spot instead of sticking them on the end of the row. This is closer to the behavior of master.
Code quote:
forEachDatumpkg/ccl/changefeedccl/cdcevent/event_test.go line 393 at r2 (raw file):
"INSERT INTO foo (i,j,a,b) VALUES (1,2,'a1','b1')", }, expectMainFamily: []decodeExpectation{
This behavior aligns with master. My previous code did not.
HonoreDB
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
|
TYFR! |
|
bors r+ |
|
Build failed (retrying...): |
|
Build succeeded: |
Previously, a rowfetcher would be configured to fetch all public columns in a
table, including virtual columns. Then, if we only wanted to read columns
belonging to a particular family, we would index into this row. This change
updates row fetchers to only fetch relevant data, including only the primary
key columns and columns in a specified family.
This change also updates the row fetcher such that it does not fetch
virtual columns. Instead, they are added on by the event descriptor.
If the descriptor specifies a column outside of the physical row,
the row iterator will fill in a null value upon iteration, representing
a virtual column.
Release note (enterprise change): When a changefeed is run with the option
virtual_columns = "null", the virtual column will be ordered last in each
row.
Informs: #80976
Epic: none